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Abstract 

Background: The assignment of DNA samples to coarse population groups can be a useful but difficult task. One 
such example is the inference of coarse ethnic groupings for forensic applications. Ethnicity plays an important role 
in forensic investigation and can be inferred with the help of genetic markers. Being maternally inherited, of high 
copy number, and robust persistence in degraded samples, mitochondrial DNA may be useful for inferring coarse 
ethnicity. In this study, we compare the performance of methods for inferring ethnicity from the sequence of the 
hypervariable region of the mitochondrial genome. 

Results: We present the results of comprehensive experiments conducted on datasets extracted from the mtDNA 
population database, showing that ethnicity inference based on support vector machines (SVM) achieves an overall 
accuracy of 80-90%, consistently outperforming nearest neighbor and discriminant analysis methods previously 
proposed in the literature. We also evaluate methods of handling missing data and characterize the most 
informative segments of the hypervariable region of the mitochondrial genome. 

Conclusions: Support vector machines can be used to infer coarse ethnicity from a small region of mitochondrial 
DNA sequence with surprisingly high accuracy. In the presence of missing data, utilizing only the regions common 
to the training sequences and a test sequence proves to be the best strategy. Given these results, SVM algorithms 
are likely to also be useful in other DNA sequence classification applications. 



Introduction 

Human ethnic identity is a controversial and complex 
topic. Each human individual is a complex mosaic of 
genetic material originating from a multitude of ances- 
tral sources. However, despite this complexity, the divi- 
sion of humans into coarse ethnic groupings can greatly 
assist forensic investigators and is also increasingly 
being used as a predictor of drug effectiveness in the 
emerging fields of personalized medicine and race-based 
therapeutics. Self-reported and investigator-assigned eth- 
nicity typically rely on the subjective interpretation of a 
complex combination of both genetic and non-genetic 
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information including behavior, cultural and societal 
norms, skin color, and other influences. For this reason, 
attempts to accurately infer probable coarse ethnic iden- 
tity can be difficult in contexts with limited access to 
most informative markers, such as skin and hair sam- 
ples. In these situations genetic information can be 
extremely valuable to forensic pursuits by significantly 
enhancing the accuracy of coarse ethnic classification in 
these contexts. 

Several approaches to genetic-based inference of eth- 
nicity have been proposed in the literature. In particular, 
the use of panels of autosomal markers have been 
shown to provide excellent accuracy for assigning sam- 
ples to specific clades [1,2]. Unfortunately, these 
approaches rely on typing large numbers of autosomal 
loci that may not survive long periods of degradation. 
Mitochondrial DNA, however, due to its high-copy 
number, is recoverable even from minute or highly 
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degraded samples. Furthermore, due to its high poly- 
morphism and maternal inheritance, mitochondrial 
DNA has proved to be an excellent marker for the infer- 
ence of ethnic affiliation. Indeed, several studies includ- 
ing [3-5] have previously shown the feasibility of 
inferring the probable ethnicity and/or geographic origin 
from the sequence of the hypervariable region (HVR) of 
the mitochondrial genome. These studies clearly demon- 
strate that, although the mitochondrial sequence alone 
does not by itself determine one's ethnicity, the two are 
nevertheless strongly associated. 

In this paper we test the utility and robustness of sev- 
eral methods for the classification of HVR mitochondrial 
sequences into coarse ethnic groups as previously 
assigned by investigators from the FBI, self-assigned by 
study subjects, or by anthropologists. The goal was to 
identify a method that could most accurately reproduce 
these classifications using only a small region of the 
mitochondrial genome. As Egeland et al. [5], we con- 
sider a supervised learning approach to ethnicity infer- 
ence. In this setting, mtDNA sequences with annotated 
ethnicity are used to "train" a classification function that 
is then used to assign ethnicities to new mtDNA 
sequences. Adopting this approach allows us to draw on 
the large body of knowledge developed within the 
machine learning community (see, e.g., [6]). The main 
goal of the paper is to assess the performance of four 
well-known classification algorithms (support vector 
machines, linear discriminant analysis, quadratic discri- 
minant analysis, and nearest neighbor) on a variety of 
benchmark datasets including realistic levels of missing 
data and training data bias. 

Comprehensive experiments conducted on mtDNA 
profiles extracted from the mtDNA population database 
[7] show that the support vector machine algorithm is 
the most accurate of compared methods, outperforming 
both discriminant analysis methods previously employed 
in [3-5]) as well as a nearest neighbor algorithm similar 
to that used for haplogroup inference in [8]. In both 
cross-validation and experiments conducted on indepen- 
dently collected training and test data, SVM achieves an 
overall accuracy of 80-90%, matching the accuracy of 
human experts making ethnicity assignments based on 
physical measurements of the skull and large bones 
[9,10], and coming close to the accuracy achieved by 
using approximately sixty autosomal loci [11]. These 
results demonstrate that SVM effectively classifies 
sequences from a small segment of the mitochondrial 
genome and that these classifications can be used to 
predict the probable assignment of coarse ethnicity with 
reasonable accuracy. The superiority of SVM in this 
classification problem suggests that it is also likely to be 
superior in similar sequence classification applications. 



Methods 

In this section, we introduce the four methods of ethni- 
city assignment investigated in this study and the data- 
sets used to evaluate their empirical performance. We 
begin by briefly introducing principal component analy- 
sis (PCA), a dimensionality reduction technique used as 
a preprocessing step for three of the four methods. We 
then describe the four classification algorithms - sup- 
port vector machines (SVM), linear discriminant analysis 
(LDA), quadratic discriminant analysis (QDA) and 1- 
nearest neighbor (INN). Finally, we describe the data- 
sets used for evaluation, the conversion of mtDNA 
sequence profiles into feature vectors, and methods of 
encoding sequences with missing regions. 

Principal component analysis 

PCA (see [6] for an introduction) is a factor analysis 
technique of dimensionality reduction. Given m samples 
over n variables, the m samples can be represented as a 
m x n matrix X. We further assume that the sample 
mean of each variable is 0, that is, X™j X^ = 0 for 
every Projecting the m samples onto n new axes yields 
another m x n matrix Y = XP, where P is a n x n 
orthogonal matrix whose columns are unit vectors 
defining the n new axes. PCA finds a P such that the 
sample covariance matrix of the n new variables is a 
diagonal matrix, that is, 

X Y =iY T Y = i(XP) T XP = P T £ x P = D, (1) 

where D is a diagonal matrix, and E x and E Y are the 
sample covariance matrices of the original and new vari- 
ables, respectively. The orthogonal matrix P can be 
easily obtained by eigenvalue decomposition of E x . PCA 
is a dimensionality reduction technique in that only k of 
the n new variables are kept for further analysis. A stan- 
dard approach is to pick the k variables with the largest 
sample variances. Therefore, all we need to do is to pick 
the value of k. Fortunately, when PCA is used in con- 
junction with supervised learning algorithms like classifi- 
cation algorithms, the best value of k can be selected by 
performing cross-validation. In this study, k was selected 
by performing 5-fold cross-validation (CV) on the train- 
ing data for each combination of dataset and classifica- 
tion algorithm. 

Classification algorithms 
Support vector machines 

The SVM [12] is a binary classification algorithm. In the 
case of perfectly separable classes, SVM seeks a separating 
hyperplane with maximum margin, while for non-separ- 
able classes the goal is to maximize a linear combination 
of the separation margin and the total amount by which 
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SVM predictions fall on the wrong side of their margin. 
Given K-element feature vectors x £ , i = 1,..., m, and an m- 
element label vector y such that y t ? {1, -1}, this amounts 
to solving the following optimization problem: 



m 

min —B T B + c\ subject to 



i=i 



(2) 



where C > 0 is a penalty constant, & is the slack vari- 
able allowing misclassification of sample i, 5>(-) is a 
function that maps x t to a high-dimensional space, often 
called the feature space, and /J, ft 0 define the optimum 
separating hyperplane /J T z + j5 0 = 0 in feature space. 
Once the optimal separating hyperplane is found, a test 
sample t is classified according to the sign of j3 T <P(i) + 
Po- 

In practice, the solution to the convex optimization 
problem (2) is obtained by solving the so-called Wolfe 
dual. Instead of explicitly mapping samples to the fea- 
ture space, solving the dual requires only a kernel func- 
tion K(xi,x 2 ) = <P{x t ) T <P(x 2 ), which implicitly maps 
samples to the feature space and simultaneously com- 
putes the inner product [12]. In this study, we used the 
software package LIBSVM [13] to conduct all SVM 
experiments. LIBSVM uses the "one-against-one" 
approach [14] when more than two classes are present. 
For all SVM experiments we used the radial basis kernel 
K(x 1 ,x 2 ) = exp(-7|xi-X2| 2 ), where 7 is a parameter. The 
penalty constant C and the parameter 7 were tuned 
using 5-fold cross-validation on the training data. 
Linear and quadratic discriminant analysis 
LDA and QDA assume that for each class the feature 
vectors follow a multivariate normal distribution [6]. 
That is, the conditional probability of a sample x given 
that it belongs to class g is given by 



fg{x) = Pr(X = x \G = g) 
1 



27rZ c 



l(oo-» ) T Z- (°°-^) 



(3) 



By applying Bayes' theorem, we obtain the posterior 
distribution as follows. 



Pr(G = g |X = x) = #^i 



(4) 



where n g is the prior probability of class g. The para- 
meters of the multivariate normal distribution are esti- 
mated using the training dataset. LDA assumes that the 
classes have a common covariance matrix (i.e., H g = E 
for every g) therefore fewer parameters need to be esti- 
mated for LDA compared to QDA. For both methods, a 



given test sample t is assigned to the class with the 
highest posterior probability 
argmax g Pr(G = g\X = t). 

In this study, we used MCLUST Version 3 [15] to 
conduct all LDA and QDA experiments. 
1-nearest neighbor (INN) 

INN is a simple non-parametric classification algorithm, 
which does not have a training process. Given a set of 
reference samples and a test sample, INN searches the 
reference dataset for the sample nearest to the test sam- 
ple and assigns the test sample to the class to which the 
nearest sample belongs. In case there are multiple near- 
est reference samples, voting is used to assign the test 
sample to the class containing the largest number of 
nearest reference samples. As discussed below, mtDNA 
profiles are encoded into binary feature vectors. We 
used the number of mismatch positions (a.k.a. the Ham- 
ming distance) to measure the distance between sam- 
ples, and did not apply PCA to the data before applying 
1-NN. 

Datasets 

We used the forensic and published tables in the 
mtDNA population database [7] to empirically evaluate 
the performance of the four algorithms for ethnicity 
assignment. The forensic table contains 4,839 samples 
collected and typed by the Federal Bureau of Investiga- 
tion (FBI), while the published table contains 6,106 sam- 
ples collected from the literature. 

In this study, we focus only on the samples annotated 
as belonging to one of the four coarse ethnic groups - 
Caucasian, African, Asian and Hispanic. Filtering the 
forensic and published tables by this criteria results in 
4,426 and 3,976 samples, respectively. In the rest of the 
paper we will refer to the two filtered tables simply as 
the forensic and published datasets. The forensic dataset 
contains 1,674 Caucasian (37.8%), 1,305 African (29.5%), 
761 Asian (17.2%) and 686 Hispanic (15.5%) samples, 
while the published dataset is comprised of 2,807 Cau- 
casian (70.6%), 254 African (6.4%) and 915 Asian (23%) 
samples. 

Additional file 1 shows the percentage of samples 
sequenced at each position for the forensic and pub- 
lished datasets. We note that the forensic dataset has a 
significantly better coverage than the published dataset. 
All the samples in the forensic dataset cover portions of 
both hypervariable region 1 (HVR1) and hypervariable 
region 2 (HVR2) of mtDNA, whereas over 60% of sam- 
ples in the published dataset do not cover HVR2 and 
around 5% of them do not cover HVR1. 

To better characterize and compare the forensic and 
published datasets, we assign each sample in the two 
datasets to one of the 23 basal haplogroups defined in 
[8]. Haplogroup assignment was performed using the 
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unweighted INN algorithm described in [8] along with 
the Genographic Project open resource mitochondrial 
DNA database (the consented database) of 21,164 sam- 
ples [16]. Behar et al. [8] reported a leave-one-out cross- 
validation accuracy of 96.72% on a reference database of 
16,609 samples. We observed a comparable accuracy of 
96.51% on the consented database. Therefore, we expect 
the inferred haplogroups of samples in the forensic and 
published datasets to have a similarly high accuracy. 
The ethnicity composition of each haplogroup and the 
inferred haplogroup composition of each broad ethnic 
group represented in the forensic and published datasets 
are given in Additional file 2. Additional file 2(A) sup- 
ports the well known fact that many haplogroups are 
strongly associated with a specific ancestry. For example, 
most samples with inferred haplogroup H, J, K, RO*, T, 
U*, and V are Caucasian, most samples with inferred 
haplogroup B, D, M, N, and R9 are Asian, and most 
samples with inferred haplogroup L are African. How- 
ever, the association is not perfect, and significant per- 
centages of these haplogroups are present in other 
ethnic groups. For some haplogroups, such as B, Nl*, 
W, and X the association with ethnicity is particularly 
weak, with two or three ethnicities being represented in 
almost equal proportions. Additional file 2 further 
shows that the forensic and published datasets have sig- 
nificant differences in their ethnic and haplogroup com- 
positions. Most strikingly, Caucasians are significantly 
over-represented and Hispanics are completely missing 
from the published dataset. Such differences are most 
likely due to the procedure used to assemble the pub- 
lished dataset, and reflects preferential use of samples 
from some ethnic groups in published studies. 

For some of the experiments described in the Results 
section, we used specific subsets of the forensic and 
published datasets. The full-length forensicdataset con- 
sists of the 1,904 samples typed for the most extensive 
ranges of HVR1 (16024-16569) and HVR2 (1-576). 
This dataset is comprised of 222 Caucasian (11.7%), 820 
African (43.1%), 415 Asian (21.8%) and 447 Hispanic 
(23.5%) samples. The trimmed forensic dataset was pro- 
duced by trimming the samples in the forensic dataset 
such that only the region of 16024-16365 in HVR1 is 
kept. It has the same ethnicity composition as the foren- 
sic dataset since all samples in the forensic dataset are 
typed in this range. The trimmed publisheddataset was 
created in a similar fashion, except that only 2,540 sam- 
ples covering the 16024-16365 region were kept. This 
subset contains 1,956 Caucasian (77%), 134 African 
(5.3%) and 450 Asian (17.7%) samples. 

Encoding mtDNA profiles into feature vectors 

Each sample in the forensic and published datasets is 
given as a list of polymorphic changes when compared 



to the revised Cambridge Reference Sequence (rCRS). 
For example, 16298C denotes a substitution at position 
16298 and 16124. 1C denotes the insertion of a C after 
position 16124. For a fixed dataset, we represent each 
sample as an M-element binary vector, where n is the 
number of unique polymorphisms present in the data- 
set. An element in the binary vector of a sample is set 
to 1 if the sample harbors the corresponding poly- 
morphism, and to 0 otherwise. This encoding method 
works well when all the samples in the dataset are 
sequenced over the same or very similar ranges. An 
example is the forensic dataset, in which all samples 
cover range 16024-16365 of HVR1 and range 73-340 of 
HVR2. While most of our experiments were obtained 
using the above binary encoding, we also discuss and 
evaluate in the Results section several alternative 
schemes for encoding mtDNA profiles with significant 
amounts of missing data. 

Results 

Comparison of the four classification algorithms 

For an initial evaluation of the four classification algo- 
rithms, we performed cross-validation (CV) analysis 
using the trimmed forensic dataset. Cross-validation is 
one of the simplest and most widely used methods for 
estimating the accuracy of classification algorithms. 
Briefly, available samples are randomly split into K 
roughly equal parts, and then each part is used to evalu- 
ate classification accuracy of a model trained on the 
remaining K - 1 parts. In our experiments we used K = 
5, i.e., 5-fold cross-validation. 

In addition to ethnicity-wise average accuracies, we 
also use micro- and macro-accuracy as measures of the 
overall performance of the classification algorithms. 
These metrics, similar to the micro-average and macro- 
average of [17], are defined as follows: 

X* C 

Micro-Accuracy = — ^ — - ; (5) 
lf =1 N ; 

Micro-Accuracy = ' — , (6) 
K*-t N, 

i=i ' 

where K is the number of classes in the dataset, N t is 
the number of samples in class i and Q is the number 
of samples correctly labeled by the classifier in class i. 
Note that micro- and macro-accuracy become the same 
when classes sizes are balanced, i.e., AT X = N 2 = ... = N K . 
For imbalanced class sizes, micro-accuracy tends to 
over-emphasize the performance on the largest classes 
compared to macro-accuracy, which gives equal weight 
to the accuracy achieved for each class. 
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Table 1 summarizes the 5-fold CV accuracy metrics 
for PCA-QDA, PCA-LDA, INN, and PCA-SVM on the 
trimmed forensic dataset. PCA-SVM consistently out- 
performs the other three classification algorithms with 
respect to all accuracy measures. Since the performance 
of different classification algorithms may depend signifi- 
cantly on the typed mtDNA region, we conducted three 
additional experiments to assess its effect on the classifi- 
cation accuracy of the four compared algorithms. In all 
three of them we started from the full-length forensics 
dataset. In the first experiment, we iteratively deleted 
10% of the polymorphisms, starting from the HVR2 end 
non-adjacent to HVR1. Similarly, in the second experi- 
ment, we iteratively deleted 10% of the polymorphisms 
starting from the HVR1 end non-adjacent to HVR2. 
Finally, in the third experiment, we used a sliding win- 
dow approach to generate 20 different datasets, each of 
which retained from the full-length forensics profiles 
10% of the nucleotides. 

Figure 1 gives the 5-fold CV micro-accuracy achieved 
by PCA-QDA, PCA-LDA, INN, and PCA-SVM in these 
three experiments. Again, PCA-SVM consistently out- 
performs the other three classification algorithms inves- 
tigated in this study. PCA-QDA is typically 
outperformed by the other methods, except that it out- 
performs INN when the entire HVR is used. INN and 
PCA-LDA have comparable performance, but PCA-LDA 
performs slightly better than INN for near-complete 
mtDNA profiles. Conversely, INN performs better than 
PCA-LDA for some short typed regions. Indeed, for 
short windows consisting of only 10% of the nucleotides 
in the entire dataset, the performance of INN is often 
as good as that of PCA-SVM, see Figure 1(C). 

Figure 1(C) further shows that, regardless of the classi- 
fication method used, certain regions of HVR1 and 
HVR2 are more informative than others for the purpose 
of ethnicity inference. Additional file 3 gives the 5-fold 
CV micro-accuracy for 6 selected windows of 165- 
271bp spanning the most informative regions of HVR1 
and HVR2. Interestingly, when using about 200bp from 
the information-rich region of HVR1, PCA-SVM yields 



Table 1 Comparison of 5-fold CV accuracy measures on 
the trimmed forensic dataset 





# Samples 


Classification Algorithm 






PCA-QDA 


PCA-LDA 


1NN 


PCA-SVM 


Caucasian 


1674 


83.15 


90.2 


93.73 


94.62 


Asian 


761 


72.93 


74.11 


83.31 


84.76 


African 


1305 


84.6 


88.28 


86.59 


89.81 


Hispanic 


686 


71.57 


68.22 


72.01 


72.59 


Micro-Accuracy 


4426 


80.03 


83.46 


86.47 


88.10 


Macro-Accuracy 


4426 


78.06 


80.20 


83.91 


85.45 



a microaccuracy of over 80%, very close to the microac- 
curacy achieved on this set when using the entire HVR 
region, i.e., HVR1+HVR2. 

Validating SVM on independent test data 

Cross-validation may overestimate the practical perfor- 
mance of classifiers since it ignores potentially signifi- 
cant biases in the assembly of reference databases. To 
obtain a more reliable estimate for the practical accu- 
racy of PCA-SVM, we evaluated its performance using 
the trimmed forensic dataset as training data and the 
trimmed published dataset as test data. Table 2 gives 
the so called confusion table for this experiment. There 
is no "Hispanic" row since there are no samples anno- 
tated as Hispanic in the trimmed published dataset used 
for testing. Since the Hispanic samples are present in 
the trimmed forensic dataset used for training, test sam- 
ples may be mis-classified as Hispanic, and thus we do 
include a "Hispanic" column. PCA-SVM micro-accuracy, 
as well as ethnicity-wise accuracies for the Caucasian 
and African ethnic groups are similar to the cross-vali- 
dation results in Table 1. However, ethnicity- wise accu- 
racy for the Asian group is almost 17% lower than the 
accuracy achieved in the cross-validation experiment. 
This is largely explained by large mismatches between 
Asian profiles used for training and testing in this 
experiment. The 761 Asian profiles in the Forensic data- 
set used for training come from only 5 countries: China 
(356 profiles), Japan (163), Korea (182), Pakistan (8), 
and Thailand (52), with a strong bias towards East Asia. 
Not surprisingly, a large percentage of misclassifications 
errors (90 out of the total of 145) are for profiles col- 
lected from two countries (Kazakhstan and Kyrgyzstan) 
that are not represented in the training dataset. Profiles 
with unknown country of origin are also poorly classi- 
fied (10 errors out of 22 samples) suggesting that they 
may come from regions that are poorly represented in 
the forensics dataset too. 

Comparison of methods for handling missing data 

In practice, forensic mtDNA profiles are determined by 
Sanger sequencing of PCR amplicons that span hyper- 
variable regions HVR1 and HVR2. Different laboratories 
use different PCR primer pairs, some of which amplify 
only parts of HVR1 and HVR2. Quality trimming of 
Sanger chromatograms further results in confident poly- 
morphism calls for a (sample dependent) subinterval of 
each amplicon. The end result are mtDNA profiles with 
a variable degree of sequence coverage, i.e., with 
unknown polymorphism status for some parts of HVR1 
and/or HVR2. In the experiments reported in previous 
sections we relied on training and test sequences cover- 
ing essentially the same range, so missing data was not 
an issue. In this section we reassess the accuracy of 
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(C) 



Figure 1 Effects of incomplete data on accuracy Comparison of PCA-QDA, PCA-LDA, 1NN, and PCA-SVM 5-fold CV micro-accuracy on regions 
obtained by iteratively deleting groups of 10% polymorphisms starting from HVR1 towards HVR2 (A), respectively from HVR2 towards HVR1 (B), 
and on sliding windows spanning 10% of the nucleotides in HVR1+HVR2 (C). 



PCA-SVM under more realistic levels of missing data. 
Specifically, we report results of experiments performed 
using as training and test data the (untrimmed) forensic 
and published datasets, respectively; as shown in Addi- 
tional file 1, the published dataset has indeed highly 
non-uniform coverage of different HVR regions. 

We investigated three different approaches of dealing 
with missing data: 



• rCRS. In this approach we simply assume that miss- 
ing regions are identical to the rCRS. While easy to 
implement, this scheme is likely to introduce a strong 
bias towards the Caucasian ethnicity since the rCRS 
sequence is of a Caucasian. 

• Probability. In this approach we augment the fea- 
ture encoding scheme described in the Methods section 
by adding a set of / additional variables, where / is the 
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Table 2 Confusion table of the PCA-SVM test results on 
the trimmed published dataset 

True Ethnicity # Samples Predicted Ethnicity 

Caucasian Asian African Hispanic 



Caucasian 


1956 


92.59 


5.47 


1.53 


0.41 


Asian 


450 


25.78 


67.78 


3.11 


3.33 


African 


134 


5.22 


3.73 


87.31 


3.73 



Micro-Accuracy: 87.91% 
Macro-Accuracy: 82.56% 



total length of HVR1 and HVR2 in bases. For typed 
bases, these variables hold the mutation status of the 
base - 1 if there is a polymorphism at this base and 0 
otherwise. For bases that are not covered by sequencing, 
the corresponding variable is set to a fractional value 
between 0 and 1 representing the polymorphism rate 
observed at this position in the training data. While less 
biased than the rCRS scheme, this scheme may still 
introduce unwanted biases in case some ethnicities are 
over- or under-represented in the training data. 

♦ Common region. In this approach we compute, for 
each test profile, the intersection between the region 
sequenced in the test profile and each training sample. 
Only these common regions of the training sequences 
are then used to infer the ethnicity of the test sample. 
The common region approach is computationally more 
demanding than the other two, since it may require run- 
ning PCA and training a new SVM for each test sample. 

Additional file 4 summarizes the results obtained by 
using the three approaches to handling missing data in 
experiments in which the forensic and published data- 
sets are used for training and evaluation classification 
accuracy, respectively. Consistent to its bias towards 
Caucasians, the rCRS approach has almost 97% accuracy 
for this ethnicity but very much lower accuracy for 
Asian and African ethnicities (about 31% and 59%, 
respectively), resulting in relatively poor overall micro- 
and macro-accuracies. The probability approach is still 
biased towards the Caucasian ethnicity, although less 
strongly than the rCRS approach. The best overall per- 
formance is achieved by the common region approach, 
which has micro- and macro-accuracies (as well as eth- 
nicity-wise accuracies) very close to those observed in 
the experiments performed on the trimmed forensic and 
published datasets (see Table 2). This suggests that the 
common region approach is a good method of dealing 
with missing data, at least in conjunction with the PCA- 
SVM method for ethnicity inference. 

A potential concern with using the common interval 
approach is that different amounts of training data are 
used in classifying different test samples. This can make 
it difficult to compare posterior probabilities returned 
by classification methods such as SVM, and may partly 



explain why, as shown in Additional file 5, SVM poster- 
ior probabilities typically under-estimate the observed 
accuracy. 

Discussion 

Correspondence between investigator assigned ethnicity 
and mitochondrial haplogroup 

Human mitochondrial haplogroups have arisen from 
mutation and migration during human evolution. As 
such, these haplogroups have been extremely powerful 
tools in understanding human evolution and particularly 
in understanding patterns of geographical migration of 
human populations. Prior to modern travel, mitochon- 
drial haplogroups were largely restricted to the geo- 
graphic regions of their origin and subsequent 
migration. For this reason, they are often superimposed 
on maps of the globe as representative of the human 
populations derived from those regions of the planet. 
Similarly, but more crudely, the coarsest ethnic group- 
ings of humans are also reflective of geographic ances- 
try. Africans, Caucasians, and Asians all have clear 
geographic associations, while Hispanic is often regarded 
as a less well defined mix of New World and European 
ancestry. Because of the clear associations of both mito- 
chondrial haplogroups and ethnic categories with geo- 
graphy, one might naively expect a simple correlation 
between the two classifications. When we analyze the 
association between mitochondrial haplogroup and 
investigator assigned ethnicity however, we find a com- 
plex relationship between the two categories. While, for 
instance, there is broad correspondence between the L 
haplogroups and African ethnicity assignments, African 
ethnicity assignments are present to varying degrees in 
virtually every haplogroup analyzed and almost every 
haplogroup contains members of each of the four ethni- 
cities. This is not particularly surprising due to the fact 
that mitochondrial DNA represents only a very small 
segment of the complex mosaic of a human's genetic 
ancestry, and it suggests that the ability to infer coarse 
ethnic identity from mitochondrial sequence would be 
very limited. In fact, however, we find that mitochon- 
drial DNA can be used to infer the probable assignment 
of coarse ethnicity with almost 90% accuracy, levels 
approaching those obtainable with approximately sixty 
autosomal loci [11]. This level of accuracy in predicting 
investigator assigned ethnicity could be very useful in 
forensic investigations. 

Information content in HVR1 and HVR2 

As noted above, there is a great deal of variability in the 
precise regions of HVR1 and HVR2 genotyped in prac- 
tice. Sequence coverage within the mitochondrial con- 
trol region is often laboratory and/or study dependent. 
Variability of these boundaries severely limits the utility 
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of individual datasets in the assembly of large datasets 
representative of complex populations. Recently, Tzen et 
al. [18] sought to redefine HVR1 on the basis of genetic 
diversity and laboratory tractability. They show that the 
237-bp segment from 16126-16362 (the "redefined" 
HVR1, or rHVRl) had a global genetic diversity of 
0.9905 and the 154-bp segment from 16209-16362 had 
a global diversity of 0.9735, where the genetic diversity 
for a sample with n haplotypes with population frequen- 
cies x it i = 1, ...,«, is computed as (l-xlUxfjn/fn-i). The 
results of [18] match very closely with our scans of the 
inferential power of windows across the control region; 
Tzen's rHVRl overlaps precisely with the region of 
greatest discriminative power in HVR1. The correspon- 
dence between these results suggests that HVR2 might 
be similarly standardized to a region between 93-310, 
where the greatest discriminative power of HVR2 is 
found. The identification of small regions of sequence 
that have maximal discriminative power could be quite 
useful in forensic and anthropological settings where 
severe degradation can limit the size of PCR products 
recoverable from sample material. Di Bernardo et al. 
[19] report that the longest amplifiable DNA fragments 
extracted from 2000-year-old remains from Pompeii are 
between 139 and 360 bp. Sequences of this size from 
the most informative regions of HVR1 and HVR2 would 
allow inference of coarse ethnic identity with reasonably 
high accuracy. 

SVM as classifier 

Many applications in human genetics require the discri- 
minative classification of samples into groups, and a 
number of methods for this task have been proposed. 
Lately, machine learning approaches have been used to 
good effect in a number of biological scenarios including 
the classification of Y-haplogroups [20]. In this study we 
use support vector machines (SVM) to develop statisti- 
cal models capable of predicting the ethnicity of mito- 
chondrial DNA samples. We compare the performance 
of SVM under simulations of real-world scenarios with 
several other methods previously proposed for the clas- 
sification of mitochondrial sequences into geographically 
defined groups, including QDA and LDA [3-5]. In all 
tests SVM provides accuracy greater or equal to that of 
the other methods tested. SVM consistently provides 
the best accuracy in simulations of degradation form 
either end of the mitochondrial hypervariable regions, 
and when small subsections of the hypervariable regions 
are used. With only 218bp of mtDNA sequence, the 
overall accuracy of SVM predictions exceeds 80%. The 
success of SVM in this classification problem suggests 
that it may also be the best method for related classifi- 
cation problems including inferring the geographic ori- 
gin of DNA samples [4,5], haplogroup membership [8], 



drug response profiles [21], and other "race based" ther- 
apeutics [22]. 

When applied to independent test data our SVM clas- 
sifier performs reasonably well despite significant differ- 
ences between the training and test sets. In particular, 
the absence of a Hispanic classification in the published 
dataset, and the inclusion of geographic regions in the 
test set that are not represented in the training set (for 
instance Kazakhstan and Kyrgyzstan) is likely to have 
contributed significantly to errors in our inferences. 
Such errors are likely to recede as larger, more geogra- 
phically balanced training sets are assembled. 

Handling missing data 

In the last few years several authors have pointed out 
the presence of sequence errors in public and forensic 
mtDNA databases [23-27]. Moreover, precise boundaries 
of HVR1 and HVR2 are not always consistent across 
studies and real-world samples may be severely 
degraded, further contributing to errors or missing data 
in samples to be classified. We evaluated several statisti- 
cal approaches to dealing with missing data and evalu- 
ated these approaches for accuracy under simulated 
scenarios of data dropout or loss. We found that despite 
a small loss of accuracy incurred by data dropout, 
restricting analysis to the region of intersection between 
the test sample and training samples provides the most 
reliable inference of the ethnicity of the sample. 
Attempts to impute any missing data based on the rCRS 
or a probabilistic model based of the training set 
resulted in prediction bias toward Caucasian due to the 
origin of the rCRS and the preponderance of Caucasian 
samples in the FBI forensic data set. Until very large, 
ethnically balanced training sets are available, restricting 
analysis to the region of intersection between test and 
training samples is likely to remain the most accurate 
and unbiased approach to inference. 

Conclusions 

In this study, we compared four classification algo- 
rithms for the prediction of probable assignment of 
coarse ethnic identity using short DNA sequences 
from the hypervariable region of mtDNA. Comprehen- 
sive empirical studies showed that, regardless of 
sequence length, support vector classification is the 
most accurate classifier among those compared and 
approaches 90% accuracy in predicting the assignment 
of course ethnic identity. Our experiments also identi- 
fied high accuracy segments in HVR, which agree well 
with the genetically diverse regions reported in pre- 
vious work. Finally, our experiments showed that, in 
dealing with missing data, it is advisable to use only 
segments shared by reference sequences and the 
sequence under test. 
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Additional material 



Additional file 1: Coverage of samples Percentage of samples 
covering each position of HVR1 and HVR2 in the forensic (A) and 
published (B) datasets. 

Additional file 2: Sample composition of the forensic and published 
datasets Ethnicity composition of each haplogroup (A) and haplogroup 
composition of each ethnic group (B) for the forensic and published 
datasets. 

Additional file 3: Accuracy of short segments of HVR Comparison of 
PCA-QDA, PCA-LDA, INN, and PCA-SVM 5-fold CV micro-accuracy on 6 
selected windows of 165-271 bp spanning the most informative regions 
of HVR1 and HVR2. 

Additional file 4: Accuracy of PCA-SVM using different schemes for 
handling missing data 

Additional file 5: Calibration of PCA-SVM posterior probabilities for 
the FBI published dataset The actual accuracy rates are slightly higher 
than the estimated posterior probabilities. 
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